V-measure score (v_measure_score)#

V-measure is an external clustering metric: it evaluates a clustering labels_pred using known ground-truth labels labels_true.

It combines two complementary requirements:

  • Homogeneity: each predicted cluster contains only members of a single class (pure clusters)

  • Completeness: all members of a given class are assigned to the same cluster (do not split classes)

V-measure is the (weighted) harmonic mean of homogeneity and completeness, so it is high only when both are high.

Learning goals#

  • Build intuition for homogeneity vs completeness

  • Derive V-measure from entropy / mutual information

  • Implement v_measure_score from scratch in NumPy

  • Use V-measure to tune a simple clustering algorithm (K-means) when labels are available

  • Know pros/cons, pitfalls, and when to use it

Quick import#

from sklearn.metrics import v_measure_score

When should you use it?#

Use V-measure when you have ground-truth categories (or a labeled validation set) and want to evaluate or compare clustering results.

If you do not have labels, prefer internal metrics (e.g., silhouette) or task-specific evaluation.


Intuition: merge vs split#

Think of the true labels as colors (classes) and the clustering output as groups (clusters).

  • If a cluster mixes many colors → it is not homogeneous.

  • If a color is scattered across many clusters → it is not complete.

V-measure forces a balance.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import (
    completeness_score,
    homogeneity_score,
    normalized_mutual_info_score,
    v_measure_score,
)

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

Warm-up: four clusterings of the same labeled dataset#

Below we create four different labels_pred arrays for the same labels_true:

  1. Perfect (up to permutation of cluster IDs)

  2. Under-clustering: everything in one cluster (complete but not homogeneous)

  3. Over-clustering: split each class into multiple clusters (homogeneous but not complete)

  4. Random assignment

n_per_class = 60
labels_true = np.repeat(np.arange(3), n_per_class)

# 1) perfect, but with a permutation of cluster IDs
perm_map = {0: 2, 1: 0, 2: 1}
labels_pred_perfect = np.vectorize(perm_map.get)(labels_true)

# 2) under-clustering: everything in one cluster
labels_pred_one_cluster = np.zeros_like(labels_true)

# 3) over-clustering: split each class into 2 clusters
#    class 0 -> clusters 0/1, class 1 -> clusters 2/3, class 2 -> clusters 4/5
split_bit = np.arange(labels_true.size) % 2
labels_pred_split_each_class = labels_true * 2 + split_bit

# 4) random assignment into 3 clusters
labels_pred_random = rng.integers(0, 3, size=labels_true.size)

cases = {
    'perfect (permute ids)': labels_pred_perfect,
    'one cluster (merge classes)': labels_pred_one_cluster,
    'split each class (over-cluster)': labels_pred_split_each_class,
    'random (3 clusters)': labels_pred_random,
}

rows = []
for name, labels_pred in cases.items():
    h = homogeneity_score(labels_true, labels_pred)
    c = completeness_score(labels_true, labels_pred)
    v = v_measure_score(labels_true, labels_pred)
    rows.extend(
        [
            {'case': name, 'metric': 'homogeneity', 'value': h},
            {'case': name, 'metric': 'completeness', 'value': c},
            {'case': name, 'metric': 'v_measure', 'value': v},
        ]
    )

fig = px.bar(
    rows,
    x='case',
    y='value',
    color='metric',
    barmode='group',
    title='Homogeneity vs completeness vs V-measure on simple labelings',
)
fig.update_layout(yaxis=dict(range=[0, 1.05]))
fig.show()

How to read the plot#

  • One cluster: completeness is 1 (each class is fully contained in one cluster), but homogeneity is low (the cluster mixes classes).

  • Split each class: homogeneity is 1 (each cluster is pure), but completeness is low (each class is spread across clusters).

  • V-measure penalizes both extremes.

Definition (information-theoretic)#

Let:

  • \(C\) be the random variable for the true class label

  • \(K\) be the random variable for the predicted cluster label

  • \(n_{ck}\) be the contingency table counts (how many samples of class \(c\) are assigned to cluster \(k\))

  • \(N = \sum_{c,k} n_{ck}\)

From the contingency table we get marginals:

  • \(n_c = \sum_k n_{ck}\)

  • \(n_k = \sum_c n_{ck}\)

  • \(p(c) = n_c / N\), \(p(k) = n_k / N\), \(p(c,k) = n_{ck} / N\)

Entropy and mutual information#

\[ H(C) = -\sum_c p(c) \log p(c), \quad H(K) = -\sum_k p(k) \log p(k) \]
\[ I(C;K) = \sum_{c,k} p(c,k) \log\frac{p(c,k)}{p(c)p(k)} \]

(Any log base works; V-measure is a ratio, so the base cancels.)

Homogeneity and completeness#

Homogeneity:

\[ h = 1 - \frac{H(C\mid K)}{H(C)} = \frac{I(C;K)}{H(C)} \]

Completeness:

\[ c = 1 - \frac{H(K\mid C)}{H(K)} = \frac{I(C;K)}{H(K)} \]

Edge cases:

  • If \(H(C)=0\) (only one true class), define \(h=1\).

  • If \(H(K)=0\) (only one predicted cluster), define \(c=1\).

V-measure#

With a weight \(\beta \ge 0\) (bigger \(\beta\) emphasizes completeness):

\[ V_{\beta} = \frac{(1+\beta)\, h\, c}{\beta\, h + c} \]

For \(\beta=1\) it is symmetric and equivalent to normalized mutual information with arithmetic normalization:

\[ V_1 = \frac{2 I(C;K)}{H(C)+H(K)} \]

NumPy implementation (from scratch)#

We will implement everything from the contingency table up.

Notes:

  • We use natural logarithms (np.log), matching scikit-learn.

  • The score is invariant to label permutations: only the contingency table matters.

def _encode_labels(labels):
    labels = np.asarray(labels)
    if labels.ndim != 1:
        raise ValueError('labels must be 1D')
    uniques, inv = np.unique(labels, return_inverse=True)
    return uniques, inv


def contingency_matrix_np(labels_true, labels_pred):
    """Contingency matrix n_{ck} with shape (n_classes, n_clusters)."""
    labels_true = np.asarray(labels_true)
    labels_pred = np.asarray(labels_pred)

    if labels_true.shape != labels_pred.shape:
        raise ValueError('labels_true and labels_pred must have the same shape')
    if labels_true.size == 0:
        raise ValueError('empty label arrays')

    classes, class_idx = _encode_labels(labels_true)
    clusters, cluster_idx = _encode_labels(labels_pred)

    cont = np.zeros((classes.size, clusters.size), dtype=np.int64)
    np.add.at(cont, (class_idx, cluster_idx), 1)
    return cont, classes, clusters


def entropy_from_counts(counts):
    """Shannon entropy of a discrete distribution given counts (natural log)."""
    counts = np.asarray(counts, dtype=float)
    total = counts.sum()
    if total <= 0:
        return 0.0

    p = counts / total
    p = p[p > 0]
    return float(-np.sum(p * np.log(p)))


def mutual_info_from_contingency(cont):
    """Mutual information I(C;K) from contingency matrix (natural log)."""
    cont = np.asarray(cont, dtype=float)
    n = cont.sum()
    if n <= 0:
        return 0.0

    pi = cont.sum(axis=1)  # class marginals
    pj = cont.sum(axis=0)  # cluster marginals

    i_idx, j_idx = np.nonzero(cont)
    n_ij = cont[i_idx, j_idx]
    return float(np.sum((n_ij / n) * np.log((n * n_ij) / (pi[i_idx] * pj[j_idx]))))


def homogeneity_completeness_v_measure_np(labels_true, labels_pred, beta=1.0):
    if beta < 0:
        raise ValueError('beta must be >= 0')

    cont, _, _ = contingency_matrix_np(labels_true, labels_pred)
    h_c = entropy_from_counts(cont.sum(axis=1))
    h_k = entropy_from_counts(cont.sum(axis=0))
    mi = mutual_info_from_contingency(cont)

    homogeneity = 1.0 if h_c == 0.0 else mi / h_c
    completeness = 1.0 if h_k == 0.0 else mi / h_k

    homogeneity = float(np.clip(homogeneity, 0.0, 1.0))
    completeness = float(np.clip(completeness, 0.0, 1.0))

    if homogeneity == 0.0 or completeness == 0.0:
        v = 0.0
    else:
        v = (1.0 + beta) * homogeneity * completeness / (beta * homogeneity + completeness)

    return homogeneity, completeness, float(v)


def v_measure_score_np(labels_true, labels_pred, beta=1.0):
    return homogeneity_completeness_v_measure_np(labels_true, labels_pred, beta=beta)[2]
# Sanity check: compare to scikit-learn on random labelings
def _check_against_sklearn(n_trials=200, n=200, n_classes=5, n_clusters=7):
    for _ in range(n_trials):
        y_true = rng.integers(0, n_classes, size=n)
        y_pred = rng.integers(0, n_clusters, size=n)

        h_np, c_np, v_np = homogeneity_completeness_v_measure_np(y_true, y_pred, beta=1.0)
        h_sk = homogeneity_score(y_true, y_pred)
        c_sk = completeness_score(y_true, y_pred)
        v_sk = v_measure_score(y_true, y_pred)

        if not (
            np.isclose(h_np, h_sk, atol=1e-12, rtol=0)
            and np.isclose(c_np, c_sk, atol=1e-12, rtol=0)
            and np.isclose(v_np, v_sk, atol=1e-12, rtol=0)
        ):
            return False, (h_np, h_sk, c_np, c_sk, v_np, v_sk)

    return True, None


ok, debug = _check_against_sklearn()
ok
True
# V-measure (beta=1) equals normalized mutual information with arithmetic normalization
for name, labels_pred in cases.items():
    v = v_measure_score(labels_true, labels_pred)
    nmi = normalized_mutual_info_score(labels_true, labels_pred, average_method='arithmetic')
    print(f'{name:30s}  v={v:.6f}  nmi(arithmetic)={nmi:.6f}')
perfect (permute ids)           v=1.000000  nmi(arithmetic)=1.000000
one cluster (merge classes)     v=0.000000  nmi(arithmetic)=0.000000
split each class (over-cluster)  v=0.760188  nmi(arithmetic)=0.760188
random (3 clusters)             v=0.005959  nmi(arithmetic)=0.005959

Visualizing what the metric sees: contingency tables#

V-measure only depends on how true classes and predicted clusters overlap.

A perfect clustering produces a contingency table that looks like a permutation of the identity matrix.

  • Under-clustering (merging classes) creates columns with many non-zero rows.

  • Over-clustering (splitting classes) creates rows with many non-zero columns.

fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=list(cases.keys()),
    horizontal_spacing=0.08,
    vertical_spacing=0.14,
)

for i, (name, labels_pred) in enumerate(cases.items()):
    cont, classes, clusters = contingency_matrix_np(labels_true, labels_pred)
    r, c = i // 2 + 1, i % 2 + 1
    fig.add_trace(
        go.Heatmap(
            z=cont,
            x=[str(k) for k in clusters],
            y=[str(c_) for c_ in classes],
            colorscale='Blues',
            showscale=i == 0,
            hovertemplate='true=%{y}<br>cluster=%{x}<br>count=%{z}<extra></extra>',
        ),
        row=r,
        col=c,
    )

fig.update_layout(
    title='Contingency tables: true class (rows) × predicted cluster (cols)',
    height=650,
)
fig.update_xaxes(title_text='predicted cluster')
fig.update_yaxes(title_text='true class')
fig.show()

The \(\beta\) parameter: choose which mistake hurts more#

  • Larger \(\beta\) emphasizes completeness (do not split classes).

  • Smaller \(\beta\) emphasizes homogeneity (do not mix classes).

Below, compare how \(V_{\beta}\) changes for a merge-style mistake vs a split-style mistake.

betas = np.logspace(-2, 2, 250)

beta_rows = []
for beta in betas:
    for name, labels_pred in {
        'merge classes (one cluster)': labels_pred_one_cluster,
        'split classes (over-cluster)': labels_pred_split_each_class,
    }.items():
        v = v_measure_score_np(labels_true, labels_pred, beta=float(beta))
        beta_rows.append({'beta': beta, 'case': name, 'v_measure': v})

fig = px.line(
    beta_rows,
    x='beta',
    y='v_measure',
    color='case',
    log_x=True,
    title='Effect of beta on V-measure',
)
fig.add_vline(x=1.0, line_dash='dash', line_color='gray')
fig.update_layout(yaxis=dict(range=[0, 1.05]))
fig.show()

Using V-measure to optimize a simple algorithm (K-means)#

V-measure is not differentiable w.r.t. cluster assignments, so you typically do not optimize it with gradient descent.

Instead, you use it for model selection when labels are available, e.g.:

  • pick the number of clusters \(k\)

  • pick the best initialization / run among many

  • tune algorithm hyperparameters

Below we:

  1. Generate a labeled 2D dataset (three Gaussian blobs)

  2. Run a low-level NumPy K-means implementation for different \(k\)

  3. Choose the \(k\) that maximizes V-measure

centers = np.array([[-2.0, 0.0], [2.0, 0.0], [0.0, 3.0]])
cluster_std = 0.6
n_per_center = 200

X_parts = []
y_parts = []
for i, mu in enumerate(centers):
    X_i = rng.normal(loc=mu, scale=cluster_std, size=(n_per_center, 2))
    y_i = np.full(n_per_center, i)
    X_parts.append(X_i)
    y_parts.append(y_i)

X = np.vstack(X_parts)
y_true = np.concatenate(y_parts)

perm = rng.permutation(X.shape[0])
X = X[perm]
y_true = y_true[perm]

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y_true.astype(str),
    title='Synthetic dataset (colored by true class)',
    labels={'x': 'x1', 'y': 'x2', 'color': 'true class'},
)
fig.show()
def kmeans_np(X, k, *, n_init=10, max_iter=100, rng=None):
    """A small NumPy K-means implementation (Lloyd's algorithm).

    Returns: (labels, centroids, inertia)
    """
    X = np.asarray(X, dtype=float)
    if X.ndim != 2:
        raise ValueError('X must be 2D')
    if k <= 0:
        raise ValueError('k must be >= 1')

    n_samples = X.shape[0]
    rng = np.random.default_rng(rng)

    best_inertia = np.inf
    best_labels = None
    best_centroids = None

    for _ in range(n_init):
        init_idx = rng.choice(n_samples, size=k, replace=n_samples < k)
        centroids = X[init_idx].copy()

        labels = None
        for _ in range(max_iter):
            d2 = ((X[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
            new_labels = d2.argmin(axis=1)

            if labels is not None and np.array_equal(new_labels, labels):
                break
            labels = new_labels

            new_centroids = centroids.copy()
            for j in range(k):
                mask = labels == j
                if not np.any(mask):
                    new_centroids[j] = X[rng.integers(0, n_samples)]
                else:
                    new_centroids[j] = X[mask].mean(axis=0)
            centroids = new_centroids

        inertia = float(((X - centroids[labels]) ** 2).sum())
        if inertia < best_inertia:
            best_inertia = inertia
            best_labels = labels.copy()
            best_centroids = centroids.copy()

    return best_labels, best_centroids, best_inertia
results = []
for k in range(2, 9):
    labels_pred, centroids_k, inertia = kmeans_np(X, k, n_init=20, max_iter=200, rng=123)
    h, c, v = homogeneity_completeness_v_measure_np(y_true, labels_pred, beta=1.0)
    results.append(
        {
            'k': k,
            'homogeneity': h,
            'completeness': c,
            'v_measure': v,
            'inertia': inertia,
            'labels_pred': labels_pred,
            'centroids': centroids_k,
        }
    )

best = max(results, key=lambda r: r['v_measure'])
best['k'], best['v_measure']
(3, 1.0)
long = []
for r in results:
    for m in ['homogeneity', 'completeness', 'v_measure']:
        long.append({'k': r['k'], 'metric': m, 'value': r[m]})

fig1 = px.line(
    long,
    x='k',
    y='value',
    color='metric',
    markers=True,
    title='External model selection with labels: maximize V-measure',
)
fig1.update_layout(yaxis=dict(range=[0, 1.05]))

results_summary = [{'k': r['k'], 'inertia': r['inertia']} for r in results]
fig2 = px.line(
    results_summary,
    x='k',
    y='inertia',
    markers=True,
    title='Inertia (K-means objective) always improves with larger k',
)

fig = make_subplots(rows=1, cols=2, subplot_titles=[fig1.layout.title.text, fig2.layout.title.text])
for tr in fig1.data:
    fig.add_trace(tr, row=1, col=1)
for tr in fig2.data:
    fig.add_trace(tr, row=1, col=2)

fig.update_layout(height=420, showlegend=True)
fig.update_yaxes(range=[0, 1.05], row=1, col=1)
fig.add_vline(x=best['k'], line_dash='dash', line_color='gray', row=1, col=1)
fig.show()
labels_best = best['labels_pred']
centroids_best = best['centroids']

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=labels_best.astype(str),
    title=f"K-means clustering for k={best['k']} (colored by predicted cluster)",
    labels={'x': 'x1', 'y': 'x2', 'color': 'cluster'},
)
fig.add_trace(
    go.Scatter(
        x=centroids_best[:, 0],
        y=centroids_best[:, 1],
        mode='markers',
        marker=dict(color='black', size=12, symbol='x'),
        name='centroids',
    )
)
fig.show()

Pros, cons, and where it is useful#

Pros#

  • Permutation-invariant (cluster IDs do not matter)

  • Bounded in \([0,1]\) and easy to compare across runs

  • Works when the number of clusters differs from the number of classes

  • Decomposes into two interpretable parts (homogeneity vs completeness)

  • With \(\beta=1\) it equals NMI with arithmetic normalization

Cons / pitfalls#

  • Requires ground-truth labels (external metric)

  • Can be pushed toward extremes:

    • many tiny clusters → high homogeneity

    • one giant cluster → high completeness

  • Ignores geometry/distances: it only evaluates the final label assignments

  • Not differentiable: typically used for evaluation or hyperparameter search, not gradient-based training

Good use cases#

  • Benchmarking clustering algorithms on labeled datasets

  • Hyperparameter selection when you have a labeled validation set (semi-supervised model selection)

  • Comparing runs/initializations of a clustering method when labels are available

Common diagnostics and pitfalls#

  • Always inspect the contingency matrix: it explains why V-measure is high/low.

  • Check homogeneity and completeness separately before trusting the combined score.

  • Choose \(\beta\) based on what errors matter:

    • if splitting a class is very bad → use larger \(\beta\)

    • if mixing classes is very bad → use smaller \(\beta\)

  • If you do not have labels, V-measure cannot be computed; use internal metrics.

Exercises#

  1. Increase the number of clusters in the random case and see how homogeneity changes.

  2. Create an imbalanced dataset (one class much larger) and see how the score behaves.

  3. Compare V-measure to adjusted mutual information (AMI) and adjusted Rand index (ARI).

  4. Change \(\beta\) and pick the \(k\) that maximizes \(V_{\beta}\) in the K-means section.

References#

  • Rosenberg & Hirschberg (2007): V-measure: A conditional entropy-based external cluster evaluation measure

  • scikit-learn docs: https://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure

  • API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html